i Appendices: 



2 Appendix A: Data 

3 Partitioning training and test data: 

4 A set of regularly spaced coordinates were generated using regularCoordinates (12) in the 

5 geosphere R package (Hijmans, Williams & Vennes 2012). The 1559 routes that were more 
e than 300 km from these coordinates were included in the training set, while the 280 routes 
? that were less than 150 km away were included in the test set. 

s Criteria for including a species in the data set: 

9 Hybrids and other ambiguous taxa were identified using simple heuristics (e.g. if the listed 

10 common or Latin name included the word "or"). Such species were excluded from the 

11 analysis; code for the heuristics used is available in the mistnet package (mistnet/inst/data 

12 extraction/species-handling. R). I also omitted any species that were observed along fewer 

13 than 10 training routes to ensure that enough data points were available for cross-validation. 

14 This left a pool of 368 species for analysis. 

is Included climate predictors: 

i6 The following eight climate predictors were selected out of the 19 Bioclim variables in the 

l? Worldclim data set, as described in the main text: 

is bio2: Mean Diurnal Range 

19 bio3: Isothermality 

20 bio5: Max Temperature of Warmest Month 

1 



21 



bio8: Mean Temperature of Wettest Quarter 



22 bio9: Mean Temperature of Driest Quarter 

23 biol5: Precipitation Seasonality 

24 biol6: Precipitation of Wettest Quarter 

25 biol8: Precipitation of Warmest Quarter 

26 Appendix B: Model fitting 

27 In an ordinary neural network, the log-likelihood gradient can be calculated for each coefficient 

28 in the network using backpropagation (Murphy 2012). Backpropagation-based training 

29 methods use a feed-forward step, where each species' probability of occurrence is calculated 

30 by the model, and a second step where the errors are propagated backward through the 

31 network using the chain rule from calculus to estimate the gradient of the log-likelihood 

32 surface. 

33 In mistnet models, the predictions, and thus the error gradients, depend on unobserved envi- 

34 ronmental variation, which means that they cannot be calculated exactly. However, one can 

35 collect Monte Carlo samples of possible error gradients, since we have prior distributions over 

36 these unobserved variables (in mistnet, these prior distributions are all standard Gaussians). 

37 Given the observed assemblage, one can therefore use an importance sampler to estimate the 

38 posterior mean of the likelihood gradient (Tang & Salakhutdinov 2013). The mistnet code 

39 then adjusts the model parameters in the direction of this estimated gradient. 



2 



40 Tang & Salakhutdinov (2013) show that this procedure is an example of generalized expectation 

41 maximization (Neal & Hinton 1998), which means that it will increase a lower bound on the 

42 model's log-likelihood until the lower bound coincides with a local maximimum. 

43 Alternative models 

44 BRT: All these analyses used the gbm package. For each species, I evaluated BRT models 

45 with trees with interaction. depth of 1, 2, 3, 5, and 8, and up to 10,000 trees, and using 

46 the default learning rate of 0.001. The number and depth of the trees was chosen by separate 

47 5-fold cross-validations for each species [?]. 

48 Deterministic neural net: nnet's hyperparameters were optimized using random search 

49 (Bergstra & Bengio 2012). During this search, the number of hidden units was sampled 

50 uniformly between 1 and 50 and the weight decay was sampled from an exponential distribution 

51 with rate parameter 1. The model was allowed 1,000 iterations to optimize the log-likelihood 

52 in each configuration. The full search for an optimal model was continued for 15 hours, which 

53 allowed the model to try 7 different configurations of hidden layer sizes and weight decay 

54 values. 

55 BayesComm: I used a development version of BayesComm. This version can 

56 be downloaded and installed using the devtools package using the following com- 

57 mand: install_github(username = "davharris", repo = "BayesComm", ref = 

58 "63fc30773cf57f8c6411789da58ffbd3439b3e62"). 

59 I used the "full" model type, which models species' responses to both observed and latent 

60 environmental factors. Over 15 hours, BayesComm performed 35,000 rounds of Gibbs 
ei sampling, discarding the first 5,000 values as "burn-in" and thinning the remainder by 60. 

3 



62 This left 500 samples, which was a small enough number to still fit in 8 gigabytes of memory 

63 with some room to spare for additional computations. 

64 mistnet: I tried 10 different hyperparameter configurations, chosen using random search 

65 (Bergstra & Bengio 2012). During this search, I varied the following hyperparameters: 



ee • three "prior" variances on the three layers of coefficients (drawn from exponentiated 

e? uniform distributions between 10"3 and 10~1) 

es • the size of the minibatches used (sampled uniformly between 10 and 100) 

69 • the number of latent Gaussian variables to include in the model (sampled uniformly 

70 between 5 and 25) 

71 • the number of importance samples to collect during each stage of gradient descent 

72 (sampled uniformly between 10 and 50) 

73 • The number of nodes to include in each of the two hidden layers (sampled uniformly 

74 between 5 and 25 for the first layer, and between n and 50 for the second layer, where 

75 n is the number of hidden units in the first layer). 



76 The initial learning rate was held constant at 0.0005. For each hyperparameter configuration, 

77 I performed five-fold cross-validation after 500 and 1000 seconds of model fitting. According to 

78 cross-validation, longer training times improved performance, and an intermediate number of 

79 Monte Carlo samples per iteration yielded the best overall performance. The best-performing 
so model had 15 hidden units in the second hidden layer. 



4 



8i References 



82 Bergstra, J. & Bengio, Y. (2012) Random Search for Hyper-Parameter Optimization. J. 

83 Mach. Learn. Res., 13, 281-305. 

84 Hijmans, R.J., Williams, E. & Vennes, C. (2012) Geosphere: Spherical Trigonometry. 

85 Murphy, K.P. (2012) Machine Learning: A Probabilistic Perspective. The MIT Press. 

se Neal, R.M. & Hinton, G.E. (1998) A view of the EM algorithm that justifies incremental, 
sparse, and other variants. Learning in graphical models pp. 355-368. Springer. 

ss Tang, Y. & Salakhutdinov, R. (2013) Learning Stochastic Feedforward Neural Networks. 

89 Advances in Neural Information Processing Systems 26 (eds & trans C.J.C. Burges), L. 

90 Bottou), M. Welling), Z. Ghahramani), & K.Q. Weinberger), pp. 530-538. 



5 



